
\documentclass[11pt]{article} % use larger type; default would be 10pt
\usepackage{framed}
\usepackage[utf8]{inputenc} % set input encoding (not needed with XeLaTeX)
\usepackage{geometry} % to change the page dimensions
\geometry{a4paper} % or letterpaper (US) or a5paper or....
% \geometry{margin=2in} % for example, change the margins to 2 inches all round
% \geometry{landscape} % set up the page for landscape
%   read geometry.pdf for detailed page layout information

\usepackage{graphicx} % support the \includegraphics command and options

% \usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent

%%% PACKAGES
\usepackage{booktabs} % for much better looking tables
\usepackage{array} % for better arrays (eg matrices) in maths
\usepackage{paralist} % very flexible & customisable lists (eg. enumerate/itemize, etc.)
\usepackage{verbatim} % adds environment for commenting out blocks of text & for better verbatim
\usepackage{subfig} % make it possible to include more than one captioned figure/table in a single float
% These packages are all incorporated in the memoir class to one degree or another...
\usepackage{framed}

%%% HEADERS & FOOTERS
\usepackage{fancyhdr} % This should be set AFTER setting up the page geometry
\pagestyle{fancy} % options: empty , plain , fancy
\renewcommand{\headrulewidth}{0pt} % customise the layout...
\lhead{}\chead{}\rhead{}
\lfoot{}\cfoot{\thepage}\rfoot{}

%%% SECTION TITLE APPEARANCE
\usepackage{sectsty}
\allsectionsfont{\sffamily\mdseries\upshape} % (See the fntguide.pdf for font help)
% (This matches ConTeXt defaults)

%%% ToC (table of contents) APPEARANCE
\usepackage[nottoc,notlof,notlot]{tocbibind} % Put the bibliography in the ToC
\usepackage[titles,subfigure]{tocloft} % Alter the style of the Table of Contents
\renewcommand{\cftsecfont}{\rmfamily\mdseries\upshape}
\renewcommand{\cftsecpagefont}{\rmfamily\mdseries\upshape} % No bold!
\begin{document}


\section*{ML Week 10 Large Scale Machine Learning}

Close
Large Scale Machine Learning

5 questions

%===============================================================%
1. 
Suppose you are training a logistic regression classifier using stochastic gradient descent. You find that the cost (say, cost($\theta$,(x(i),y(i))), averaged over the last 500 examples), plotted as a function of the number of iterations, is slowly increasing over time. Which of the following changes are likely to help?

\begin{itemize}
\item CORRECT  Try halving (decreasing) the learning rate $\alpha$, and see if that causes the cost to now consistently go down; and if not, keep halving it until it does.

\item Try averaging the cost over a smaller number of examples (say 250 examples instead of 500) in the plot.

\item This is not possible with stochastic gradient descent, as it is guaranteed to converge to the optimal parameters $\theta$.

\item Use fewer examples from your training set.
\end{itemize}

%===============================================================%
2. 
Which of the following statements about stochastic gradient

descent are true? Check all that apply.

\begin{itemize}
\item WRONG Suppose you are using stochastic gradient descent to train a linear regression classifier. The cost function  is guaranteed to decrease after every iteration of the stochastic gradient descent algorithm.
% - J(θ)=12m∑mi=1(hθ(x(i))−y(i))2
\item 
SELECTED You can use the method of numerical gradient checking to verify that your stochastic gradient descent implementation is bug-free. 
%(One step of stochastic gradient descent computes the partial derivative ∂∂θjcost(θ,(x(i),y(i))).)
\item
Before running stochastic gradient descent, you should randomly shuffle (reorder) the training set.
\item 
SELECTED In order to make sure stochastic gradient descent is converging, we typically compute Jtrain($\theta$) after each iteration (and plot it) in order to make sure that the cost function is generally decreasing.
\end{itemize}
%===============================================================%
3. 
Which of the following statements about online learning are true? Check all that apply.

 \begin{itemize}

\item WRONG One of the advantages of online learning is that there is no need to pick a learning rate $\alpha$.

\item CORRECT In the approach to online learning discussed in the lecture video, we repeatedly get a single training example, take one step of stochastic gradient descent using that example, and then move on to the next example.

\item WRONG One of the disadvantages of online learning is that it requires a large amount of computer memory/disk space to store all the training examples we have seen.

\item CORRECT When using online learning, in each step we get a new example (x,y), perform one step of (essentially stochastic gradient descent) learning on that example, and then discard that example and move on to the next.

\item WRONG One of the disadvantages of online learning is that it requires a large amount of computer memory/disk space to store all the training examples we have seen.

\end{itemize}
%===============================================================%
4. 
Assuming that you have a very large training set, which of the following algorithms do you think can be parallelized using map-reduce and splitting the training set across different machines? Check all that apply.


\begin{itemize}

\item CORRECT A neural network trained using batch gradient descent.

\item WRONG Logistic regression trained using stochastic gradient descent.

\item CORRECT Linear regression trained using batch gradient descent.

\item WRONG An online learning setting, where you repeatedly get a single example (x,y), and want to learn from that single example before moving on.

\item WRONG Computing the average of all the features in your training set %μ=1m∑mi=1x(i) (say in order to perform mean normalization).

\end{itemize}
%===============================================================%
5. 
Which of the following statements about map-reduce are true? Check all that apply.

\begin{itemize}
\item WRONG Running map-reduce over N computers requires that we split the training set into $N^2$ pieces.

\item WRONG In order to parallelize a learning algorithm using map-reduce, the first step is to figure out how to express the main work done by the algorithm as computing sums of functions of training examples.


\item CORRECT  When using map-reduce with gradient descent, we usually use a single machine that accumulates the gradients from each of the map-reduce machines, in order to compute the parameter update for that iteration.


\item CORRECT  If you are have just 1 computer, but your computer has multiple CPUs or multiple cores, then map-reduce might be a viable way to parallelize your learning algorithm.
\end{itemize}
%-----------------%

TRUE Because of network latency and other overheads associated with map-reduce, if we run map-reduce using N computers, we might get less than an N-fld speedup compared to using 1 computer




\newpage

\subsection*{Stochastic Gradient Checking}

\begin{itemize}
	\item In each iteration of the stochastic gradient, the algorithm needs to examine/use only one training example.
	\item Each iteration updates the parameters based on the cost of only one example $ Cost (\theta,(x^{(i)},y^{(i)}))$.
	\item You can use numerical gradient checking to verify if your stochastic gradient descent implementation is bug-free.
\end{itemize}
\newpage
\section*{Machine Learning System Design}

Suppose a massive dataset is available for training a learning algorithm. Training on a lot of data is likely to give good performance when two of the following conditions hold true. Which are the two?
\begin{itemize}
	\item[(i)] We train a learning algorithm with a small number of parameters (that is thus unlikely to overfit).	
	
	\textbf{No} - If the model has a small number of parameters, then it will underfit the large training set and not make good use of all the data.
	%------------------------%
	\item[(ii)] We train a learning algorithm with a large number of parameters (that is able to learn/represent fairly complex functions).	
	
	\textbf{Yes} - You should use a "\textit{\textbf{low bias}}" algorithm with many parameters, as it will be able to make use of the large dataset provided. If the model has too few parameters, it will underfit the large training set.
	%------------------------%
	\item[(iii)] The features x contain sufficient information to predict y accurately. (For example, one way to verify this is if a human expert on the domain can confidently predict y when given only x).	
	
	\textbf{Yes} - It is important that the features contain sufficient information, as otherwise no amount of data can solve a learning problem in which the features do not contain enough information to make an accurate prediction.
	%------------------------%
	\item[(iv)]When we are willing to include high order polynomial features of x (such as $x^2_1$, $x^2_2$, $x_1x_2$, etc.).	
	
	\textbf{No} - As we saw with neural networks, polynomial features can still be insufficient to capture the complexity of the data, especially if the features are very high-dimensional. Instead, you should use a complex model with many parameters to fit to the large training set.
	
\end{itemize}
\newpage



Suppose you are training a logistic regression classifier using stochastic gradient descent. You find that the cost (say, cost(θ,(x(i),y(i))), averaged over the last 500 examples), plotted as a function of the number of iterations, is slowly increasing over time. Which of the following changes are likely to help?

Try using a larger learning rate α.

Try averaging the cost over a larger number of examples (say 1000 examples instead of 500) in the plot.

This is not an issue, as we expect this to occur with stochastic gradient descent.

SELECTED Try using a smaller learning rate α.

%===============================================================%

Which of the following statements about stochastic gradient

descent are true? Check all that apply.

Before running stochastic gradient descent, you should randomly shuffle (reorder) the training set.

In order to make sure stochastic gradient descent is converging, we typically compute Jtrain(θ) after each iteration (and plot it) in order to make sure that the cost function is generally decreasing.

SELECTED One of the advantages of stochastic gradient descent is that it uses parallelization and thus runs much faster than batch gradient descent.

SELECTED If you have a huge training set, then stochastic gradient descent may be much faster than batch gradient descent.

%-----------------------------------------------%

Which of the following statements about online learning are true? Check all that apply.

One of the advantages of online learning is that there is no need to pick a learning rate α.

SELECTED When using online learning, in each step we get a new example (x,y), perform one step of (essentially stochastic gradient descent) learning on that example, and then discard that example and move on to the next.

SELECTED In the approach to online learning discussed in the lecture video, we repeatedly get a single training example, take one step of stochastic gradient descent using that example, and then move on to the next example.

One of the disadvantages of online learning is that it requires a large amount of computer memory/disk space to store all the training examples we have seen.

%-----------------------------------------------%

Assuming that you have a very large training set, which of the

following algorithms do you think can be parallelized using

map-reduce and splitting the training set across different

machines? Check all that apply.

An online learning setting, where you repeatedly get a single example (x,y), and want to learn from that single example before moving on.

Linear regression trained using batch gradient descent.

A neural network trained using batch gradient descent.

Logistic regression trained using stochastic gradient descent.

%-----------------------------------------------%

Which of the following statements about map-reduce are true? Check all that apply.

In order to parallelize a learning algorithm using map-reduce, the first step is to figure out how to express the main work done by the algorithm as computing sums of functions of training examples.

CORRECT If you are have just 1 computer, but your computer has multiple CPUs or multiple cores, then map-reduce might be a viable way to parallelize your learning algorithm.

CORRECT If you have only 1 computer with 1 computing core, then map-reduce is unlikely to help.

CORRECT Because of network latency and other overhead associated with map-reduce, if we run map-reduce using N computers, we might get less than an N-fold speedup compared to using 1 computer.

Linear regression and logistic regression can be parallelized using map-reduce, but not neural network training.

When using map-reduce with gradient descent, we usually use a single machine that accumulates the gradients from each of the map-reduce machines, in order to compute the parameter update for that iteration.
\end{document}
